# load libraries
library(knitr)
library(tidyverse)8 Data Transformation:
Learning Objectives
After completing this lab you should be able to
- use
mutate()to add new columns. - use
group_by()to group rows. - use
summarize()to calculate summary statistics.
For each of our modules we will have a project-folder with an Rproject, *.qmd-files, and sub-directories for data, scripts, and results as described in our Rproject Tutorial. You should have a directory on your Desktop or Documents folder on your laptop (name it something like bi349) as a home directory for all of our project folders this semester.
You should have already downloaded the project directory for this module, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.
Once you have opened a project you should see the project name in the top right corner1.
1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.
There should be a document names 08_data-transformation-ii.qmd in your project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did]. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.
2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.
Let’s start by loading the packages we will need for this activity.
And we will want to make sure that we have read in our data set as a dataframe.
# read catch data
catch <- read_delim("data/longline_catchdata.txt", delim = "\t")8.1 Adding new variables
So,turns out selecting columns and filtering based on content in rows is pretty straightforward.
But frequently when we are processing our raw data sets we end up wanting to compute additional metrics or use the existing raw data to create new categories.
The function mutate() can be used to create new columns. Frequently, this is done based on columns already existing in the data frame. This is a very powerful function with endless possibilities, but we are going to stick to some of the basics for now3.
3 Rest assured if your answer is “Oh, could I …” the answer is “Yes”.
Let’s say you wanted create a column that contained the difference between the fork length and the stretch total length4:
4 By default mutate() appends (adds) the new column as the last column. So we can see our results better we’ll used select() to move it to be the first column in the dataframe)
catch %>%
mutate(difference = STL - FL) %>%
select(difference, everything())# A tibble: 2,325 × 13
difference Site Species Sex Observed_Stage PCL FL STL Hook_Size
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 66 Aransas_… Bagre_… U <NA> NA 287 353 10
2 70 Aransas_… Bagre_… U <NA> NA 425 495 10
3 86 Aransas_… Bagre_… U <NA> NA 416 502 15
4 91 Aransas_… Bagre_… U <NA> NA 416 507 10
5 92 Aransas_… Bagre_… U <NA> NA 418 510 15
6 81 Aransas_… Bagre_… U <NA> NA 434 515 10
7 93 Aransas_… Bagre_… U <NA> NA 427 520 15
8 86 Aransas_… Bagre_… U <NA> NA 446 532 10
9 73 Aransas_… Bagre_… U <NA> NA 465 538 10
10 89 Aransas_… Bagre_… U <NA> NA 450 539 10
# ℹ 2,315 more rows
# ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>
You should now have a column called difference at the end of the data frame5.
5 Instead of - to substract, you can other mathematical operators such as + to add , * to multiple, and / to divide values when creating a new column.
You can also create a column that contains a logical value (TRUE/FALSE). For example we might need a column that indicates if the Sex is unknown.
catch %>%
mutate(unknown_sex = Sex == "U") %>%
select(unknown_sex, everything())# A tibble: 2,325 × 13
unknown_sex Site Species Sex Observed_Stage PCL FL STL Hook_Size
<lgl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 TRUE Aransas… Bagre_… U <NA> NA 287 353 10
2 TRUE Aransas… Bagre_… U <NA> NA 425 495 10
3 TRUE Aransas… Bagre_… U <NA> NA 416 502 15
4 TRUE Aransas… Bagre_… U <NA> NA 416 507 10
5 TRUE Aransas… Bagre_… U <NA> NA 418 510 15
6 TRUE Aransas… Bagre_… U <NA> NA 434 515 10
7 TRUE Aransas… Bagre_… U <NA> NA 427 520 15
8 TRUE Aransas… Bagre_… U <NA> NA 446 532 10
9 TRUE Aransas… Bagre_… U <NA> NA 465 538 10
10 TRUE Aransas… Bagre_… U <NA> NA 450 539 10
# ℹ 2,315 more rows
# ℹ 4 more variables: Set <dbl>, Day <dbl>, Month <dbl>, Year <dbl>
You should know have a column called unknown_sex where if the animal that was caught was not sexed contains the value TRUE, if it was identified as male or female it would say FALSE.
For that last problem, a “conditional mutate” using an ifelse statement (if this then do that, else do that) could have come in handy. Another option is case_when() which allows you to create multiple sets of conditions as opposed to ifelse which sets up a TRUE/FALSE dichotomy (file this information away for “maybe useful later”).
8.2 group_by() and mutate()
Many problems in data science require you to split your data set into subsets according to some grouping variable, apply a function, and then combine the results. dplyr is designed to make this straightforward; you have already sen an example of this while you were learning about filter().
Similarly, you can combine mutate() with group_by().
The function mean() will calculate the mean value of a vector of numbers, the argument na.rm=TRUE tells the function to ignore any NA-values in the data set.
For example, let’s say you wanted to create a column that is the difference between the fork length of an individual and the mean fork length of that species.
catch %>%
group_by(Species) %>%
mutate(diff_mean = FL-mean(FL, na.rm = TRUE))# A tibble: 2,325 × 13
# Groups: Species [14]
Site Species Sex Observed_Stage PCL FL STL Hook_Size Set Day
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aransas… Bagre_… U <NA> NA 287 353 10 1 28
2 Aransas… Bagre_… U <NA> NA 425 495 10 1 28
3 Aransas… Bagre_… U <NA> NA 416 502 15 1 28
4 Aransas… Bagre_… U <NA> NA 416 507 10 1 28
5 Aransas… Bagre_… U <NA> NA 418 510 15 1 28
6 Aransas… Bagre_… U <NA> NA 434 515 10 1 28
7 Aransas… Bagre_… U <NA> NA 427 520 15 1 28
8 Aransas… Bagre_… U <NA> NA 446 532 10 1 28
9 Aransas… Bagre_… U <NA> NA 465 538 10 1 28
10 Aransas… Bagre_… U <NA> NA 450 539 10 1 28
# ℹ 2,315 more rows
# ℹ 3 more variables: Month <dbl>, Year <dbl>, diff_mean <dbl>
8.3 Create new data.frame based on another
Not infrequently we are more interested in summary (descriptive) stats of a data set rather than all the raw data - Tidyverse got you covered with the function summarize().
For example, we might want to calculate the mean and standard deviation of the measured fork length.
catch %>%
summarize(mean_FL = mean(FL, na.rm = TRUE),
sd_FL = sd(FL, na.rm = TRUE))# A tibble: 1 × 2
mean_FL sd_FL
<dbl> <dbl>
1 406. 103.
Remember, that earlier we’ve have used the function max() to obtain the largest value in a vector.
That’s cool but really we could have also just used
mean(catch$FL, na.rm = TRUE)[1] 405.9179
max(catch$FL, na.rm = TRUE)[1] 1140
to get that information, since we are only interested in one column (vector).
summarize() becomes especially powerful once we leverage group_by() to start calculating summary stats for entries grouped by a grouping variable.
For example we can calculate summary stats by species and generate a table to include in a report.
catch %>%
group_by(Species) %>%
summarize(mean_FL = mean(FL, na.rm = TRUE),
median_FL = median(FL, na.rm = TRUE),
max_FL = max(FL, na.rm = TRUE),
min_FL = min(FL, na.rm = TRUE),
sd_FL = sd(FL, na.rm = TRUE)) %>%
ungroup()# A tibble: 14 × 6
Species mean_FL median_FL max_FL min_FL sd_FL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Bagre_marinus 433. 445 575 45 65.4
2 Carcharhinus_brevipinna 644. 648 900 489 69.2
3 Carcharhinus_leucas 769 702 1140 624 167.
4 Carcharhinus_limbatus 613. 579 757 538 101.
5 Carcharhinus_porosus 415 415 415 415 NA
6 Hypanus_americanus NaN NA -Inf Inf NA
7 Hypanus_sabina NaN NA -Inf Inf NA
8 Rhinoptera_bonasus NaN NA -Inf Inf NA
9 Rhizoprionodon_terraenovae 412 396 637 306 73.6
10 Sciades_felis 299. 297 480 102 41.9
11 Sciaenops_ocellatus 793 793 841 745 67.9
12 Sphyrna_lewini 471. 548. 578 210 174.
13 Sphyrna_tiburo 622. 605 861 370 114.
14 Synodus_foetens 173 173 173 173 NA
So far, we have been manipulating our data frame using code and printing it directly to the console (and our quarto document). This can be useful for example to generate tables for reports. However, in many cases we want to create a new object that has been manipulated according to our code and then we will further process, visualize, or analyze that dataframe down the line.
summary <- catch %>%
group_by(Species) %>%
summarize(mean_FL = mean(FL, na.rm = TRUE),
median_FL = median(FL, na.rm = TRUE),
max_FL = max(FL, na.rm = TRUE),
min_FL = min(FL, na.rm = TRUE),
sd_FL = sd(FL, na.rm = TRUE)) %>%
ungroup()When you execute this code, you’ll notice that the code (and probably a warning) is printed to the console but there is no output. Instead, if you look at the environment in the bottom left panel you should now see a new object called summary. Per usual, you can pull that up in the Editor/View pane (top left) using either View(summary) in the console or by clicking on the object in the environment.
You will be presenting results in reports over the course of the semester, when you knit an quarto file you will get tables formatted in a standard way according to defaults in the resulting html file. If you want finer control over the output, you can use the kable() function. This will allow you to further format the table, for example, you may specify the number of digits printed using the argument digits =.
By adding a chunk options for a label as #| label: tbl-sum-stats and table caption as #| tbl-cap: "Summary statistics for the forklength of each species, you can further modify the output that adheres to typical reporting standards for reports and research articles.
kable(
summary,
digits = 1
)| Species | mean_FL | median_FL | max_FL | min_FL | sd_FL |
|---|---|---|---|---|---|
| Bagre_marinus | 433.4 | 445.0 | 575 | 45 | 65.4 |
| Carcharhinus_brevipinna | 643.7 | 648.0 | 900 | 489 | 69.2 |
| Carcharhinus_leucas | 769.0 | 702.0 | 1140 | 624 | 167.3 |
| Carcharhinus_limbatus | 613.2 | 579.0 | 757 | 538 | 101.0 |
| Carcharhinus_porosus | 415.0 | 415.0 | 415 | 415 | NA |
| Hypanus_americanus | NaN | NA | -Inf | Inf | NA |
| Hypanus_sabina | NaN | NA | -Inf | Inf | NA |
| Rhinoptera_bonasus | NaN | NA | -Inf | Inf | NA |
| Rhizoprionodon_terraenovae | 412.0 | 396.0 | 637 | 306 | 73.6 |
| Sciades_felis | 298.9 | 297.0 | 480 | 102 | 41.9 |
| Sciaenops_ocellatus | 793.0 | 793.0 | 841 | 745 | 67.9 |
| Sphyrna_lewini | 470.8 | 547.5 | 578 | 210 | 174.4 |
| Sphyrna_tiburo | 621.5 | 605.0 | 861 | 370 | 114.4 |
| Synodus_foetens | 173.0 | 173.0 | 173 | 173 | NA |
8.4 Combining verbs
We’ve already combined most of our dplyr verbs with group_by().
When you are wrangling data you will find that making use of the pipe (%>%) to combine select(), filter(), mutate(), and summarize() as a series of commands will be necessary to get your data set in the correct format and further process it.
Generate the code that will manipulate the data frame as follows6:
6 some bullet points may require more than one line of code; you do not have to perform the steps in the sequence presented, play around a little bit to see how to code this more efficiently